Markov Chain Monte Carlo II

Chelsea Parlett-Pelleriti

Metropolis Hastings Review

MCMC Review

  1. Propose a candidate state \(J\left (x_t | x_{t-1} \right )\) using some proposal distribution \(J\)

  2. Accept proposal with probability \(A = min \left (1, \frac{f(x_*)}{f(x_n)} \times \frac{q(x_n | x_*)}{q(x_* | x_n)} \right )\)

Note: if \(J\) is symmetric (\(J\left (x_t | x_{t-1} \right ) = J\left (x_{t-1} | x_t \right )\)), the bias correction term \(\frac{q(x_n | x_*)}{q(x_* | x_n)}\) goes away, simplifying to the Metropolis Algorithm

MH Algorithm Review

metropolis_step <- function(x, sigma, target) {
  # generate proposal
  proposed_x <- rnorm(1, mean = x, sd = sigma)
  
  # calculate A
  accept_prob <- min(1, target(proposed_x) / target(x))
  
  # accept/reject
  accepted <- runif(1) <= accept_prob
  value <- ifelse(accepted,
                  proposed_x,
                  x)
  
  value
}

MH Algorithm Review

target <- function(x) {
  exp(-x^2) * (2 + sin(5 * x) + sin(2 * x))
}

iter <- 1000

samp <- 0
samples <- rep(0,iter)
for (i in 2:iter){
  samp <- metropolis_step(samp, sigma = 1,
                          target = target)
  samples[i] <- samp
  
}

Target Distribution

Target Distribution

MH Hyperparameters

  • Proposal Width: how far away from current value we’re willing to go

  • Burn-in: how long the chain gets to reach stationary distribution

  • Thinning: how many samples we thin to get rid of autocorrelation

Proposal Width

df <- data.frame(samples = samples, t = 1:length(samples))

ggplot(df, aes(x = t, y = samples)) + 
  geom_line() + 
  theme_minimal() + 
  labs(title = "Sample Trace Plot: Sigma = 1")

Proposal Width

samples_df <- mh(1000, sigma = 0.25, target = target)

ggplot(samples_df, aes(x = t, y = samples)) + 
  geom_line() + 
  theme_minimal() + 
  labs(title = "Sample Trace Plot: Sigma = 0.25")

Proposal Width

samples_df <- mh(1000, sigma = 100, target = target)

ggplot(samples_df, aes(x = t, y = samples)) + 
  geom_line() + 
  theme_minimal() + 
  labs(title = "Sample Trace Plot: Sigma = 100")

Note: Why do you think the trace plot look like this?

Burn-In

Burn-In

Thinning

Gibbs Sampling

GS Overview

  • Multidimensional (multiple parameters) samples

  • \(p \left (x_1, x_2, ... x_n \right)\) is tough to sample but \(p \left (x_n | ... \right)\) is not

GS Overview

Pros: special case of MH where proposal is always accepted

Cons: need \(p \left ( x_1^{(1)} | x_2^{(0)},...,x_n^{(0)} \right)\) …etc

GS Algorithm

  1. Start with \(\mathbf{x^{(0)}} = \left ( x_{1}^{(0)}, x_{2}^{(0)},...x_{n}^{(0)} \right)\)

  2. Sample \(x_1^{(1)} \sim p \left ( x_1^{(1)} | x_2^{(0)},...,x_n^{(0)} \right)\)

  3. Sample \(x_2^{(1)} \sim p \left ( x_2^{(1)} | x_1^{(1)},...,x_n^{(0)} \right)\)

  4. Repeat for all \(x_i \in \mathbf{x}\)

GS Algorithm

GS Algorithm

GS Algorithm

GS Algorithm

Hamiltonian Monte Carlo (HMC)

HMC Proposals

Problem: MH + GS work well, but proposals are inefficient, or difficult to get


Solution: use information about shape of target distribution to generate better proposals

HMC Proposals

\[ \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

HMC Proposals

HMC Proposals

  1. generate random momentum vector (a “flick” of the friction-less particle)

  2. use a physics simulation that takes current position \(x_0\) and momentum \(w_0\) and generates a new state \(\left ( x_T, w_T \right)\)

  3. accept with \(A = min \left (1, \frac{exp(-H(x_T, w_T))}{exp(-H(x_0, w_0))} \right )\)

Note: HMC tends to produce states with similar energy (energy should be totally conserved, theoretically), so acceptance rate is high

Detour: Energy

\[ H(q,p) = \underbrace{U(q)}_\text{potential energy} + \underbrace{K(p)}_\text{kinetic energy} \]

  • \(U(q)\): potential energy (potential energy is high when you’re on top of a “hill”)

  • \(K(p)\): kinetic energy (kinetic energy is high when you’re moving fast)

HMC Proposals

Hyperparameters:

  • Leapfrog Step Size

  • Leapfrog Step #

HMC Proposals

MCMC Simulation

HMC Acceptance

\[ A = min \left (1, \frac{exp(-H(x_T, w_T))}{exp(-H(x_0, w_0))} \right ) \]

  • when energy \(H(x,w)\) stays the same or decreases, acceptance prob is high

  • when energy \(H(x,w)\) increases, acceptance prob is low because the proposed sample is unlikely

HMC Acceptance

HMC Acceptance

Go to the Whiteboard!

HMC Struggles

MCMC Simulation

MCMC Diagnostics

  • Effective Sample Size (ESS)

  • Trace plots/\(\hat{R}\) (r-hat)

  • (HMC) Divergent Transitions

Trace Plots + \(\hat{R}\)

Question: have our chains converged to the target distribution?

Answer:

  1. start chain from different values

  2. compare samples from chains

  3. should look like a fuzzy caterpillar (not kidding)

Trace Plots + \(\hat{R}\)

Trace Plots + \(\hat{R}\)

\(\hat{R}\) is a metric that summarizes similar information as a trace plot

\[ \hat{R} = \frac{\hat{V}}{W} \]

where \(\hat{V}\) is the pooled variance of all the chains, and \(W\) is the within chain variance.


Rule of Thumb: R-hat < 1.1

Effective Sample Size

\[ N_{eff} = \frac{N}{1 + 2\sum_{t=1}^{\infty} \rho_t} \]

How many independent random samples would we need to get the same amount of information as our MCMC samples give us

Measures efficiency of samples, but there’s nothing inherently wrong with low ESS.

Rule of Thumb: ESS > no. chains * 100

Divergent Transitions

In HMC, when we reject proposals when the estimated trajectory diverges from the true trajectory (measured by \(H(x_T, w_T) - H(x_0, w_0)\))

Lots of divergences often means biased samples because HMC cannot effectively explore one or more parts of the distribution.

Target Distribution Analysis 🍑

Advantages of Samples

Don’t calculate, simulate!

# X ~ N(0,1)
X <- rnorm(1e5,0,1) # n,mu,sd
Z <- X*4.5 + 10
W <- X^2
print(paste0("Z: ", round(mean(Z),2), ", W: ", round(mean(W),2)))
[1] "Z: 10.01, W: 1"

MCMC Plots

MCMC Plots

MCMC Summary

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: bill_length_mm ~ bill_depth_mm 
   Data: penguins (Number of observations: 342) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept        55.10      2.50    50.26    59.95 1.00     4063     2774
bill_depth_mm    -0.65      0.14    -0.93    -0.37 1.00     4092     2775

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     5.33      0.20     4.95     5.75 1.00     3920     3144

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
[1] "0.14625 proportion of posterior samples for b_bill_depth_mm are >= 0.05"